This paper proposes to improve visual question answering (VQA) withstructured representations of both scene contents and questions. A keychallenge in VQA is to require joint reasoning over the visual and textdomains. The predominant CNN/LSTM-based approach to VQA is limited bymonolithic vector representations that largely ignore structure in the sceneand in the form of the question. CNN feature vectors cannot effectively capturesituations as simple as multiple object instances, and LSTMs process questionsas series of words, which does not reflect the true complexity of languagestructure. We instead propose to build graphs over the scene objects and overthe question words, and we describe a deep neural network that exploits thestructure in these representations. This shows significant benefit over thesequential processing of LSTMs. The overall efficacy of our approach isdemonstrated by significant improvements over the state-of-the-art, from 71.2%to 74.4% in accuracy on the "abstract scenes" multiple-choice benchmark, andfrom 34.7% to 39.1% in accuracy over pairs of "balanced" scenes, i.e. imageswith fine-grained differences and opposite yes/no answers to a same question.
展开▼